Predicting Aviation Accident Severity: A Data-Driven Approach to Enhancing Air Travel Safety
¶

Done by:¶

- Pramod Kumar (UID:121032911)¶

- Swapnita Sahu (UID:121292223)¶

Table of Contents¶

  1. Introduction
  2. Importing Necessary Libraries
  3. Data Collection
  4. Downloading a dataset from Kaggle
  5. Data Import
  6. Data Cleaning
  7. Exploratory Data Analysis
    • Outlier Analysis
    • Distributions of Variables
    • Handling Skewness
    • Feature Correlations
  8. Bivariate Analysis
  9. Feature Scaling
  10. Model Building
    • Feature Selection
    • Train-Test Split
  11. Model Training
    • Baseline Models
    • Neural Network
  12. Model Evaluation
    • Performance Metrics
    • Confusion Matrix
  13. Conclusion

Introduction¶

The aviation industry is a global powerhouse, with millions of flights operated each year. Though air travel is one of the safest modes of transportation, yet the stakes remain high when accidents do occur. A single crash can ripple through economies, impact regulatory policies, and shift public perception of air travel. Understanding and predicting the severity of these crashes is not only crucial for enhancing safety protocols but also for minimizing the impact on passengers, crews, and the wider community. Moreover, as the aviation industry continues to evolve, with the introduction of new aircraft technologies and operational protocols, a data-driven approach to understanding potential crash outcomes becomes even more vital.

Recent events have underscored the importance of this issue. For instance, the tragic crash of a passenger plane in early 2024 raised questions about existing safety measures and the effectiveness of current predictive models. Investigations revealed that while initial crash predictions indicated low risk, unforeseen factors led to a disastrous outcome. Such incidents highlight the necessity for more robust predictive frameworks that can analyze various parameters—including weather conditions, human factors, and aircraft maintenance history—to provide more accurate severity assessments.

This project sits at the intersection of machine learning and real-world applications, showcasing the transformative power of data science in critical fields. By leveraging advanced algorithms and big data analytics, we can derive insights from vast amounts of historical flight data, accident reports, and environmental conditions. As we embark on this project, we aim not only to contribute to the body of knowledge in aviation safety but also to illustrate how data science can drive meaningful change in sectors that affect our daily lives. By harnessing the power of machine learning, we strive to create a safer future for air travel, ensuring that lessons learned from past incidents lead to actionable insights that protect lives.

Importing Necessary Libraries¶

In [82]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
import seaborn as sns
%matplotlib inline
from sklearn.model_selection import train_test_split
import os
from scipy import stats

from google.colab import files
import os

import warnings
warnings.filterwarnings(action= 'ignore')

To get started with this tutorial, the first step is to import the essential Python libraries, as demonstrated above. These libraries will be instrumental throughout the process. Using Jupyter Notebook is highly recommended for this tutorial.

A key library we’ll work with is Pandas, a powerful open-source tool for data analysis built on Python. It offers a user-friendly and flexible approach to data manipulation, allowing us to perform a variety of transformations effortlessly.

Another critical library we'll utilize is NumPy, which is designed for high-performance computations on large datasets. It provides a robust framework for storing, processing, and performing complex operations on data, streamlining the analysis process.

Setting a common style to visualize plots¶

In [83]:
sns.set_theme(style="whitegrid", palette="deep")

Data Collection¶

The Airplane Accidents Severity Dataset on Kaggle provides detailed information on airplane accidents that occurred between 2010 and 2018. It consists of two CSV files: "train.csv" and "test.csv". The training dataset contains 10,000 rows and 13 columns, while the testing dataset includes 2,725 rows and 12 columns. Each row corresponds to a unique airplane accident. The dataset features the following columns:

  • Accident_ID: A unique identifier assigned to each accident.
  • Accident_Type_Code: A numerical code indicating the type of accident (e.g., "1" for "Controlled Flight Into Terrain," "2" for "Loss of Control In Flight," etc.).
  • Cabin_Temperature: The cabin temperature at the time of the accident, measured in degrees Celsius.
  • Turbulence_In_gforces: The g-force experienced by the aircraft during the incident.
  • Control_Metric: A measure of the pilot's ability to maintain control during the accident.
  • Total_Safety_Complaints: The total number of safety complaints filed against the airline in the 12 months leading up to the accident.
  • Days_Since_Inspection: The number of days since the aircraft's last inspection.
  • Safety_Score: A metric that evaluates the overall safety performance of the airline.
  • Severity: The severity level of the accident, categorized as "Minor_Damage_And_Injuries," "Significant_Damage_And_Fatalities," "Significant_Damage_And_Serious_Injuries," or "Highly_Fatal_And_Damaging."
  • Accident_Type_Description: A detailed description of the type of accident.
  • Max_Elevation: The highest altitude achieved by the aircraft during the flight.
  • Violations: The number of safety violations recorded for the airline in the 12 months preceding the accident.
  • Adverse_Weather_Metric: A metric assessing weather conditions during the time of the accident.

Downloading a dataset from Kaggle¶

  • Create your API token in your kaggle account.
  • Download your token as kaggle.json
  • Upload it in google colab.
  • Create a root directory for kaggle in your directory, upload the .json file here.
  • Give it permission.
  • Download the dataset
  • Unzip it and start using it.

Data Import¶

In [84]:
df_train = pd.read_csv('/content/train.csv')
df_train.head()
Out[84]:
Severity Safety_Score Days_Since_Inspection Total_Safety_Complaints Control_Metric Turbulence_In_gforces Cabin_Temperature Accident_Type_Code Max_Elevation Violations Adverse_Weather_Metric Accident_ID
0 Minor_Damage_And_Injuries 49.223744 14 22 71.285324 0.272118 78.04 2 31335.47682 3 0.424352 7570.0
1 Minor_Damage_And_Injuries 62.465753 10 27 72.288058 0.423939 84.54 2 26024.71106 2 0.352350 12128.0
2 Significant_Damage_And_Fatalities 63.059361 13 16 66.362808 0.322604 78.86 7 39269.05393 3 0.003364 2181.0
3 Significant_Damage_And_Serious_Injuries 48.082192 11 9 74.703737 0.337029 81.79 3 42771.49920 1 0.211728 5946.0
4 Significant_Damage_And_Fatalities 26.484018 13 25 47.948952 0.541140 77.16 3 35509.22852 2 0.176883 9054.0
In [85]:
df_test = pd.read_csv('/content/test.csv')
df_test.head()
Out[85]:
Safety_Score Days_Since_Inspection Total_Safety_Complaints Control_Metric Turbulence_In_gforces Cabin_Temperature Accident_Type_Code Max_Elevation Violations Adverse_Weather_Metric Accident_ID
0 19.497717 16 6 72.151322 0.388959 78.32 4 37949.724386 2 0.069692 1
1 58.173516 15 3 64.585232 0.250841 78.60 7 30194.805567 2 0.002777 10
2 33.287671 15 3 64.721969 0.336669 86.96 6 17572.925484 1 0.004316 14
3 3.287671 21 5 66.362808 0.421775 80.86 3 40209.186341 2 0.199990 17
4 10.867580 18 2 56.107566 0.313228 79.22 2 35495.525408 2 0.483696 21
In [86]:
df_train.shape
Out[86]:
(10000, 12)
In [87]:
df_train.describe()
Out[87]:
Safety_Score Days_Since_Inspection Total_Safety_Complaints Control_Metric Turbulence_In_gforces Cabin_Temperature Accident_Type_Code Max_Elevation Violations Adverse_Weather_Metric Accident_ID
count 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.00000 10000.000000 9990.000000
mean 41.775112 12.931100 6.564300 65.039961 0.381495 79.808721 3.814900 32001.803282 2.01220 0.255635 6266.962162
std 16.442025 3.539803 6.971982 12.381440 0.121301 4.512422 1.902577 9431.995196 1.03998 0.381128 3610.023741
min -99.000000 1.000000 0.000000 -86.000000 0.134000 0.000000 1.000000 831.695553 0.00000 0.000316 2.000000
25% 30.502283 11.000000 2.000000 56.927985 0.293665 77.950000 2.000000 25757.636910 1.00000 0.012063 3140.500000
50% 41.187215 13.000000 4.000000 65.587967 0.365879 79.530000 4.000000 32060.336420 2.00000 0.074467 6282.500000
75% 52.511416 15.000000 9.000000 73.336372 0.451346 81.550000 5.000000 38380.641515 3.00000 0.354059 9390.750000
max 100.000000 23.000000 54.000000 100.000000 0.882648 97.510000 7.000000 64297.651220 5.00000 2.365378 12500.000000
In [88]:
df_test.shape
Out[88]:
(2500, 11)
In [89]:
df_test.describe()
Out[89]:
Safety_Score Days_Since_Inspection Total_Safety_Complaints Control_Metric Turbulence_In_gforces Cabin_Temperature Accident_Type_Code Max_Elevation Violations Adverse_Weather_Metric Accident_ID
count 2500.000000 2500.000000 2500.000000 2500.000000 2500.000000 2500.000000 2500.000000 2500.000000 2500.000000 2500.000000 2500.000000
mean 41.825224 12.946400 6.574800 65.368058 0.376197 79.993068 3.853600 32383.134179 1.990800 0.250886 6186.283200
std 16.280187 3.523364 7.179542 11.442005 0.116960 2.713833 1.877652 9485.096436 1.018592 0.387663 3602.235035
min 0.000000 1.000000 0.000000 20.966272 0.143376 74.740000 1.000000 831.695553 0.000000 0.000368 1.000000
25% 30.593607 11.000000 1.000000 57.702826 0.292583 77.930000 2.000000 26008.851717 1.000000 0.013136 3071.750000
50% 41.461187 13.000000 4.000000 66.066545 0.357404 79.600000 4.000000 32472.865497 2.000000 0.072466 6159.500000
75% 52.751142 15.000000 9.000000 73.119872 0.441699 81.530000 5.000000 38759.519071 3.000000 0.315407 9309.250000
max 100.000000 23.000000 54.000000 97.994531 0.881926 94.200000 7.000000 62315.408444 5.000000 2.365378 12493.000000

Data Cleaning¶

  • The code given below checks for missing values in the training and testing datasets by summing the null entries for each column.
  • It helps identify which columns have missing values and their extent.
In [90]:
df_train.isnull().sum()
Out[90]:
0
Severity 0
Safety_Score 0
Days_Since_Inspection 0
Total_Safety_Complaints 0
Control_Metric 0
Turbulence_In_gforces 0
Cabin_Temperature 0
Accident_Type_Code 0
Max_Elevation 0
Violations 0
Adverse_Weather_Metric 0
Accident_ID 10

In [91]:
df_test.isnull().sum()
Out[91]:
0
Safety_Score 0
Days_Since_Inspection 0
Total_Safety_Complaints 0
Control_Metric 0
Turbulence_In_gforces 0
Cabin_Temperature 0
Accident_Type_Code 0
Max_Elevation 0
Violations 0
Adverse_Weather_Metric 0
Accident_ID 0

  • The Accident_ID column has 10 missing values. Considering the dataset contains 10,000 records, removing these 10 rows is a reasonable approach. This represents only 0.1% of the data, and the impact on the analysis or model performance will be negligible.

  • By removing Accident_ID and Accident_Type_Code, we ensure the dataset maintains its integrity and avoids issues caused by missing values, ultimately resulting in a more reliable dataset for further analysis or machine learning tasks.

In [92]:
# Removing rows where Accident_ID is null
df_train = df_train.dropna(subset=['Accident_ID'])

# Verifying the changes
print(df_train.info())
<class 'pandas.core.frame.DataFrame'>
Index: 9990 entries, 0 to 9999
Data columns (total 12 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Severity                 9990 non-null   object 
 1   Safety_Score             9990 non-null   float64
 2   Days_Since_Inspection    9990 non-null   int64  
 3   Total_Safety_Complaints  9990 non-null   int64  
 4   Control_Metric           9990 non-null   float64
 5   Turbulence_In_gforces    9990 non-null   float64
 6   Cabin_Temperature        9990 non-null   float64
 7   Accident_Type_Code       9990 non-null   int64  
 8   Max_Elevation            9990 non-null   float64
 9   Violations               9990 non-null   int64  
 10  Adverse_Weather_Metric   9990 non-null   float64
 11  Accident_ID              9990 non-null   float64
dtypes: float64(7), int64(4), object(1)
memory usage: 1014.6+ KB
None
In [93]:
testing2= df_test.drop(['Accident_Type_Code'], axis=1)
testing2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2500 entries, 0 to 2499
Data columns (total 10 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Safety_Score             2500 non-null   float64
 1   Days_Since_Inspection    2500 non-null   int64  
 2   Total_Safety_Complaints  2500 non-null   int64  
 3   Control_Metric           2500 non-null   float64
 4   Turbulence_In_gforces    2500 non-null   float64
 5   Cabin_Temperature        2500 non-null   float64
 6   Max_Elevation            2500 non-null   float64
 7   Violations               2500 non-null   int64  
 8   Adverse_Weather_Metric   2500 non-null   float64
 9   Accident_ID              2500 non-null   int64  
dtypes: float64(6), int64(4)
memory usage: 195.4 KB
In [94]:
#Print their respective shapes
print("Shape of training data is:", df_train.shape)
print("Shape of testing data is:", testing2.shape)
Shape of training data is: (9990, 12)
Shape of testing data is: (2500, 10)
In [95]:
df_train['Severity'].value_counts()
Out[95]:
count
Severity
Highly_Fatal_And_Damaging 3036
Significant_Damage_And_Serious_Injuries 2707
Minor_Damage_And_Injuries 2514
Significant_Damage_And_Fatalities 1688
Significant_Damge_And_Serious_Injuries 9
Minor_Damage_And_Injry 6
Highly_Fatl_And_Damaging 4
Highly_Fatal_And_Damagin 4
Minor_Damage_And_Injuries 4
Significant_Damage_And_Serious_Injry 4
Significant_Dmg_And_Fatalities 3
Highly_Fatal_And_Dmg 3
Sigificant_Damage_And_Serious_Injuries 3
Minor_Damge_And_Injuries 3
Sigificant_Damage_And_Fatalities 2

We count the occurrences of each unique value in the Severity column to understand its distribution.

In [96]:
!pip install fuzzywuzzy
Requirement already satisfied: fuzzywuzzy in /usr/local/lib/python3.10/dist-packages (0.18.0)
In [97]:
from fuzzywuzzy import process

# Define the list of correct categories
valid_categories = [
    'Highly_Fatal_And_Damaging',
    'Significant_Damage_And_Serious_Injuries',
    'Minor_Damage_And_Injuries',
    'Significant_Damage_And_Fatalities'
]

# Function to match each value to the closest valid category
def match_severity(value):
    return process.extractOne(value, valid_categories)[0]

# Apply the function to the 'Severity' column
df_train['Severity'] = df_train['Severity'].apply(match_severity)

# Verify the changes
print(df_train['Severity'].value_counts())
Severity
Highly_Fatal_And_Damaging                  3047
Significant_Damage_And_Serious_Injuries    2723
Minor_Damage_And_Injuries                  2527
Significant_Damage_And_Fatalities          1693
Name: count, dtype: int64
  • The above code installs the fuzzywuzzy library, which helps with string matching and correction.
  • The process module is used to find the closest matching string from a list of valid categories.
  • Defines a list of valid severity categories.
  • Applies fuzzy matching to the Severity column to correct any inconsistent or misspelled entries.
  • Ensures that the Severity column values align with the predefined valid categories.
In [98]:
df_train['Cabin_Temperature'].value_counts()
Out[98]:
count
Cabin_Temperature
78.46 47
80.98 43
78.37 42
79.17 41
81.26 40
... ...
81.65 1
78.07 1
85.25 1
80.10 1
85.31 1

951 rows × 1 columns


We count the frequency of each unique value in the Cabin_Temperature column.

In [99]:
# Filter rows where Cabin_Temperature is 0
cabin_temp_zero = df_train[df_train['Cabin_Temperature'] == 0]

# Display the rows with Cabin_Temperature = 0
cabin_temp_zero
Out[99]:
Severity Safety_Score Days_Since_Inspection Total_Safety_Complaints Control_Metric Turbulence_In_gforces Cabin_Temperature Accident_Type_Code Max_Elevation Violations Adverse_Weather_Metric Accident_ID
139 Highly_Fatal_And_Damaging 22.602740 15 15 57.976299 0.527797 0.0 4 36653.79366 0 0.067038 9293.0
356 Significant_Damage_And_Serious_Injuries 68.675799 8 2 45.396536 0.382107 0.0 7 24664.52893 2 0.002104 6770.0
1705 Highly_Fatal_And_Damaging 28.219178 13 0 83.819508 0.185208 0.0 7 32693.57769 2 0.002961 8131.0
1860 Highly_Fatal_And_Damaging 23.424658 15 18 74.749316 0.296640 0.0 3 32539.84510 2 0.162085 12118.0
2487 Minor_Damage_And_Injuries 62.648402 10 15 58.751139 0.467213 0.0 4 26782.02428 2 0.049252 4843.0
2609 Highly_Fatal_And_Damaging 22.009132 15 5 65.360073 0.342799 0.0 1 26397.20312 2 0.973443 7254.0
3087 Highly_Fatal_And_Damaging 43.652968 9 8 73.473108 0.311065 0.0 3 33257.59519 0 0.165002 11131.0
4315 Significant_Damage_And_Serious_Injuries 63.698630 6 4 75.387420 0.238940 0.0 3 36736.29014 3 0.183405 4681.0
4645 Highly_Fatal_And_Damaging 9.315068 18 4 56.608933 0.273199 0.0 4 33643.98877 2 0.061453 5891.0
4789 Significant_Damage_And_Serious_Injuries 29.634703 16 0 81.586144 0.433675 0.0 6 34431.91837 0 0.009106 4290.0
4827 Significant_Damage_And_Serious_Injuries 39.269406 13 1 64.175023 0.330899 0.0 2 32739.24598 2 0.444452 10314.0
5074 Significant_Damage_And_Serious_Injuries 87.488584 6 21 52.461258 0.475868 0.0 7 37447.46191 1 0.003466 59.0
6204 Minor_Damage_And_Injuries 52.465753 13 7 85.095716 0.199633 0.0 5 45783.52766 2 0.031615 10679.0
6660 Minor_Damage_And_Injuries 71.598174 7 0 57.247037 0.268872 0.0 4 22841.76341 3 0.042758 6820.0
6899 Significant_Damage_And_Fatalities 55.936073 15 4 68.960802 0.253726 0.0 7 21703.21590 2 0.001702 5113.0
7088 Minor_Damage_And_Injuries 45.433790 15 14 68.596171 0.340275 0.0 5 52015.79442 2 0.035319 5383.0
7603 Highly_Fatal_And_Damaging 33.242009 11 8 76.618049 0.371649 0.0 2 25605.09866 0 0.344927 11415.0
8599 Significant_Damage_And_Serious_Injuries 59.817352 14 20 56.153145 0.391483 0.0 7 23226.23689 3 0.002349 11388.0
9770 Highly_Fatal_And_Damaging 53.972603 12 4 56.791249 0.544025 0.0 2 42365.06106 3 0.576201 10972.0
9897 Significant_Damage_And_Fatalities 76.027397 9 6 65.132179 0.331981 0.0 7 31631.37761 2 0.002919 8755.0

The above code filters and displays rows where Cabin_Temperature is 0, which might indicate incorrect or missing data.

In [100]:
# Calculate the median of Cabin_Temperature excluding zeros
median_temp = df_train.loc[df_train['Cabin_Temperature'] != 0, 'Cabin_Temperature'].median()

# Replace all Cabin_Temperature = 0 with the median
df_train.loc[df_train['Cabin_Temperature'] == 0, 'Cabin_Temperature'] = median_temp

# Verify the changes
print(f"Median used for replacement: {median_temp}")
df_train['Cabin_Temperature'].value_counts()
Median used for replacement: 79.54
Out[100]:
count
Cabin_Temperature
78.46 47
80.98 43
78.37 42
79.17 41
81.26 40
... ...
80.10 1
85.31 1
94.20 1
83.87 1
84.49 1

950 rows × 1 columns


  • We calculate the median of the Cabin_Temperature column, excluding rows where the value is 0.
  • Median is chosen as it is less sensitive to outliers compared to the mean, ensuring a robust replacement value.
  • All occurrences of 0 are replaced in the Cabin_Temperature column with the calculated median. This ensures the dataset does not contain invalid values while preserving the column's overall distribution.
  • We confirm the value used for replacement and verifies the updated frequency distribution of Cabin_Temperature.
In [101]:
df_train= df_train.drop(['Accident_ID'], axis=1)
df_train.head()
Out[101]:
Severity Safety_Score Days_Since_Inspection Total_Safety_Complaints Control_Metric Turbulence_In_gforces Cabin_Temperature Accident_Type_Code Max_Elevation Violations Adverse_Weather_Metric
0 Minor_Damage_And_Injuries 49.223744 14 22 71.285324 0.272118 78.04 2 31335.47682 3 0.424352
1 Minor_Damage_And_Injuries 62.465753 10 27 72.288058 0.423939 84.54 2 26024.71106 2 0.352350
2 Significant_Damage_And_Fatalities 63.059361 13 16 66.362808 0.322604 78.86 7 39269.05393 3 0.003364
3 Significant_Damage_And_Serious_Injuries 48.082192 11 9 74.703737 0.337029 81.79 3 42771.49920 1 0.211728
4 Significant_Damage_And_Fatalities 26.484018 13 25 47.948952 0.541140 77.16 3 35509.22852 2 0.176883

We removed the Accident_ID column from the training dataset.

  • Accident_ID: A unique identifier that does not contribute directly to the analysis or model building.

Exploratory Data Analysis¶

The dataset is now clean, consistent, and ready for reliable analysis or modeling

Outlier Analysis¶

In [102]:
import matplotlib.pyplot as plt
import seaborn as sns

# Prepare numerical data for boxplots
num_df = df_train[['Safety_Score', 'Control_Metric', 'Turbulence_In_gforces',
                   'Cabin_Temperature', 'Max_Elevation', 'Adverse_Weather_Metric', 'Total_Safety_Complaints']]

# Set the number of rows and columns for the grid
num_cols = 2  # 2 boxplots per row
num_plots = len(num_df.columns)
rows = (num_plots + num_cols - 1) // num_cols  # Calculate required rows

# Create the figure and axes
fig, axes = plt.subplots(rows, num_cols, figsize=(14, rows * 4))  # Adjust size dynamically
axes = axes.flatten()  # Flatten axes array for easier indexing

# Plot each variable as a boxplot
for i, col in enumerate(num_df.columns):
    sns.boxplot(data=num_df[col], color='skyblue', width=0.6, ax=axes[i])
    axes[i].set_title(f'Boxplot of {col}', fontsize=14, color='black', pad=10)  # Title
    axes[i].set_xlabel(col, fontsize=12, color='black', labelpad=10)  # X-axis label
    axes[i].set_ylabel('Value', fontsize=12, color='black', labelpad=10)  # Y-axis label
    axes[i].grid(visible=True, color='gray', linestyle='--', linewidth=0.5, alpha=0.6)  # Grid styling

# Hide any unused subplots
for j in range(num_plots, len(axes)):
    axes[j].set_visible(False)

# Adjust layout to fit everything nicely
plt.tight_layout()

# Show the plot
plt.show()
No description has been provided for this image

The primary focus here is to visualize the distribution and detect potential outliers for key numerical variables using boxplots. Boxplots are particularly useful for summarizing the range, interquartile range, median, and identifying outliers in data.

This approach provides a comprehensive visual analysis of numerical data, enabling:

  • Identification of outliers that may skew the analysis or modeling.
  • Comparison of distributions across different features.
  • Quick insights into the data's structure and variability.

As can be seen from boxplots above, the data is prone to a lot of outliers especially variables like 'Total_Safety_Complaints', 'Adverse_Weather_Metric' and 'Turbulence_in_gforces'. Removing them does not make sense as it will lead to a lot of data loss. Let's see if we can improve the situation by transforming these variables

In [103]:
#Let's map the Dependent variable to their respective categorial dummies
df_train['Severity']= df_train.Severity.map({'Minor_Damage_And_Injuries': '1', 'Significant_Damage_And_Fatalities': '2', 'Significant_Damage_And_Serious_Injuries': '3', 'Highly_Fatal_And_Damaging': '4'})
df_train.head()
Out[103]:
Severity Safety_Score Days_Since_Inspection Total_Safety_Complaints Control_Metric Turbulence_In_gforces Cabin_Temperature Accident_Type_Code Max_Elevation Violations Adverse_Weather_Metric
0 1 49.223744 14 22 71.285324 0.272118 78.04 2 31335.47682 3 0.424352
1 1 62.465753 10 27 72.288058 0.423939 84.54 2 26024.71106 2 0.352350
2 2 63.059361 13 16 66.362808 0.322604 78.86 7 39269.05393 3 0.003364
3 3 48.082192 11 9 74.703737 0.337029 81.79 3 42771.49920 1 0.211728
4 2 26.484018 13 25 47.948952 0.541140 77.16 3 35509.22852 2 0.176883

The dependent variable Severity is mapped to categorical dummy values to streamline analysis and ensure consistent representation. The mapping is as follows:

  • 'Minor_Damage_And_Injuries' → '1'
  • 'Significant_Damage_And_Fatalities' → '2'
  • 'Significant_Damage_And_Serious_Injuries' → '3'
  • 'Highly_Fatal_And_Damaging' → '4'

This transformation converts descriptive labels into numeric representations, simplifying visualization and modeling tasks.

In [104]:
# Importing necessary libraries
import seaborn as sns
import matplotlib.pyplot as plt

# Define the figure and axes with a specific size
fig, ax = plt.subplots(figsize=(12, 6))

# Create the count plot
sns.countplot(
    data=df_train,
    x='Severity',
    palette='coolwarm',
    order=df_train['Severity'].value_counts().index,  # Sort by frequency
    saturation=0.8,
    ax=ax  # Use the defined axis
)

# Add a title and labels
ax.set_title('Distribution of Severity', fontsize=16, fontweight='bold', pad=15, color='black')
ax.set_xlabel('Severity', fontsize=12, labelpad=10, color='black')
ax.set_ylabel('Count', fontsize=12, labelpad=10, color='black')

# Rotate x-axis labels for better readability
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, fontsize=10)
ax.tick_params(axis='y', labelsize=10)

# Adjust layout to remove extra space
fig.tight_layout()

# Display the plot
plt.show()
No description has been provided for this image

Insights from the Plot

  • The count plot provides a clear view of how the data is distributed across the four severity categories.
  • It highlights potential class imbalances, which are crucial to address in subsequent modeling steps, particularly for classification problems.

Distributions of Variables¶

In [105]:
import matplotlib.pyplot as plt
import seaborn as sns

# Define the number of variables and create subplots
num_vars = num_df.columns
num_plots = len(num_vars)
rows = (num_plots + 2) // 3  # Arrange in a grid with 3 columns per row
fig, axes = plt.subplots(rows, 3, figsize=(15, rows * 4))  # Adjust size dynamically
axes = axes.flatten()  # Flatten the 2D array of axes for easier indexing

# Set a consistent theme
sns.set_theme(style="whitegrid")

# Plot each variable in its respective subplot
for i, var in enumerate(num_vars):
    sns.histplot(num_df[var], kde=True, color="skyblue", ax=axes[i])  # Use histplot with KDE
    axes[i].set_title(f"Distribution of {var}", fontsize=14, pad=10)
    axes[i].set_xlabel(var, fontsize=12)
    axes[i].set_ylabel("Frequency", fontsize=12)

# Hide any unused subplots
for j in range(num_plots, len(axes)):
    axes[j].set_visible(False)

# Adjust layout to remove unwanted spaces
plt.tight_layout()

# Show the plot
plt.show()
No description has been provided for this image

Handling Skewness¶

Skewed distributions can adversely affect machine learning models by violating the assumptions of normality in some algorithms. To address this, specific transformations are applied to normalize the data.

Initial Visualization Histograms with KDE:

  • Each numerical variable is plotted using sns.histplot() with KDE (Kernel Density Estimate) overlays to observe the data distribution.
  • Left Skew: Variables like Control_Metric show a distribution with a longer tail on the left.
  • Right Skew: Variables such as Cabin_Temperature, Total_Safety_Complaints, Adverse_Weather_Metric, and Turbulence_In_gforces have distributions with a longer tail on the right.
In [106]:
#Fixing the right skew
num_df['Total_Safety_Complaints'] = np.log(num_df['Total_Safety_Complaints']+1) #+1 cause the log here takes a negative value
num_df['Adverse_Weather_Metric'] = np.log(num_df['Adverse_Weather_Metric'])
num_df['Cabin_Temperature'] = np.log(num_df['Cabin_Temperature'])
num_df['Turbulence_In_gforces'] = np.log(num_df['Turbulence_In_gforces'])

#Fixing left skew
num_df['Control_Metric'] = np.power(num_df['Control_Metric'], 2)

Transformation of Skewed Variables

To normalize the distributions:

Right-Skewed Variables:

Log transformations are applied using np.log(), which compresses the right tail and spreads out values near zero. Variables Transformed:

  • Total_Safety_Complaints (added +1 to avoid logarithm of zero).
  • Adverse_Weather_Metric
  • Cabin_Temperature
  • Turbulence_In_gforces

Left-Skewed Variable:

A power transformation is applied to Control_Metric by squaring the values (np.power(x, 2)) to correct the skewness.

In [107]:
import matplotlib.pyplot as plt
import seaborn as sns

# Define the number of variables and create subplots
num_vars = num_df.columns
num_plots = len(num_vars)
rows = (num_plots + 2) // 3  # Arrange in a grid with 3 columns per row
fig, axes = plt.subplots(rows, 3, figsize=(15, rows * 4))  # Adjust size dynamically
axes = axes.flatten()  # Flatten the 2D array of axes for easier indexing

# Set a consistent theme
sns.set_theme(style="whitegrid")

# Plot each variable in its respective subplot
for i, var in enumerate(num_vars):
    sns.histplot(num_df[var], kde=True, color="skyblue", ax=axes[i])  # Use histplot with KDE
    axes[i].set_title(f"Distribution of {var}", fontsize=14, pad=10)
    axes[i].set_xlabel(var, fontsize=12)
    axes[i].set_ylabel("Frequency", fontsize=12)

# Hide any unused subplots
for j in range(num_plots, len(axes)):
    axes[j].set_visible(False)

# Adjust layout to remove unwanted spaces
plt.tight_layout()

# Show the plot
plt.show()
No description has been provided for this image

Post-Transformation Visualization

The histograms are re-plotted after the transformations:

  • Variables previously skewed to the right exhibit more symmetrical distributions post log transformation.
  • Control_Metric no longer shows left skewness after the power transformation.

Advantages of Transformation

  • Improved Model Performance:
    • Normalized distributions help in reducing model bias and variance.
    • Algorithms sensitive to distribution perform better with transformed data.
  • Enhanced Interpretability:
    • Correcting skewness ensures that summary statistics like mean and standard deviation better represent the data.

Feature Correlations¶

In [108]:
# Calculate the correlation matrix
correlation = num_df.corr()

# Create a heatmap with better styling
plt.figure(figsize=(12, 8))  # Adjust the figure size
sns.set_theme(style="white")  # Use a clean white background theme

# Create the heatmap
heatmap = sns.heatmap(
    correlation,
    annot=True,  # Annotate each cell with the correlation value
    fmt=".2f",  # Format the numbers to 2 decimal places
    cmap="coolwarm",  # Use a color palette
    vmin=-1, vmax=1,  # Ensure the color range is consistent
    linewidths=0.5,  # Add thin lines between cells
    annot_kws={"size": 10, "color": "black"}  # Customize annotations
)

# Add title and labels
plt.title("Correlation Heatmap of Numerical Variables", fontsize=16, fontweight='bold', pad=15)
plt.xticks(fontsize=10, rotation=45, ha='right')  # Rotate x-axis labels for better readability
plt.yticks(fontsize=10, rotation=0)  # Keep y-axis labels horizontal

# Remove extra spaces and display
plt.tight_layout()
plt.show()
No description has been provided for this image

No variables show worrying levels of correlation to each other except 'Control_Metric' and 'Turbulence_In_gforces'. However, keeping them in the model yielded better results. Also, 0.6 is more closer to 0.5 than any extreme so i decided to keep them.

In [109]:
#Let's put the entire dataset back together
Rem= df_train[['Days_Since_Inspection','Violations','Accident_Type_Code','Severity']]
train2= pd.concat([num_df, Rem], axis=1)
train2.head()
Out[109]:
Safety_Score Control_Metric Turbulence_In_gforces Cabin_Temperature Max_Elevation Adverse_Weather_Metric Total_Safety_Complaints Days_Since_Inspection Violations Accident_Type_Code Severity
0 49.223744 5081.597362 -1.301521 4.357222 31335.47682 -0.857192 3.135494 14 3 2 1
1 62.465753 5225.563379 -0.858166 4.437225 26024.71106 -1.043130 3.332205 10 2 2 1
2 63.059361 4404.022241 -1.131328 4.367674 39269.05393 -5.694652 2.833213 13 3 7 2
3 48.082192 5580.648392 -1.087586 4.404155 42771.49920 -1.552452 2.302585 11 1 3 3
4 26.484018 2299.101968 -0.614077 4.345881 35509.22852 -1.732265 3.258097 13 2 3 2

Bivariate analysis¶

  1. Comparises more than one attribute in a graph.
  2. Visualization of graph.
  3. Uncover hidden pattern and relation between the attributes.

Here we describe which attribute is Categorical and Quantitative:

Categorical:

  1. Violations.

  2. Accident_Type_Code.

  3. Days_Since_Inspection.

  4. Target Variable: Severity

Quantitative:

  1. Adverse_Weather_Metric

  2. Max_Elevation

  3. Cabin_Temperature

  4. Turbulence_In_gforces

  5. Control_Metric

  6. Total_Safety_Complaints

  7. Safety_Score

Compare both Categorical and Quantitative attributes together.¶

1. Adverse_Weather_Metric¶

In [110]:
fig, axes = plt.subplots(3, 1, figsize=(14, 12))

# Custom color palette for better aesthetics
custom_palette = sns.color_palette("coolwarm")

# Plot 1: Violations vs Adverse Weather Metric
sns.boxplot(
    ax=axes[0],
    x='Violations',
    y='Adverse_Weather_Metric',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[0].set_title('Violations vs Adverse Weather Metric', fontsize=14, weight='bold')
axes[0].set_xlabel('Violations', fontsize=12)
axes[0].set_ylabel('Adverse Weather Metric', fontsize=12)
axes[0].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 2: Accident Type Code vs Adverse Weather Metric
sns.boxplot(
    ax=axes[1],
    x='Accident_Type_Code',
    y='Adverse_Weather_Metric',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[1].set_title('Accident Type Code vs Adverse Weather Metric', fontsize=14, weight='bold')
axes[1].set_xlabel('Accident Type Code', fontsize=12)
axes[1].set_ylabel('Adverse Weather Metric', fontsize=12)
axes[1].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 3: Days Since Inspection vs Adverse Weather Metric
sns.boxplot(
    ax=axes[2],
    x='Days_Since_Inspection',
    y='Adverse_Weather_Metric',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[2].set_title('Days Since Inspection vs Adverse Weather Metric', fontsize=14, weight='bold')
axes[2].set_xlabel('Days Since Inspection', fontsize=12)
axes[2].set_ylabel('Adverse Weather Metric', fontsize=12)
axes[2].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Adjusting layout for improved alignment and spacing
fig.tight_layout()
plt.subplots_adjust(right=0.85)  # To make space for legends

plt.show()
No description has been provided for this image
In [111]:
fig, axes = plt.subplots(3, 1, figsize=(14, 12))

# Custom color palette for better aesthetics
custom_palette = sns.color_palette("coolwarm")

# Plot 1: Violations vs Adverse Weather Metric
sns.lineplot(
    ax=axes[0],
    x='Violations',
    y='Adverse_Weather_Metric',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[0].set_title('Violations vs Adverse Weather Metric', fontsize=14, weight='bold')
axes[0].set_xlabel('Violations', fontsize=12)
axes[0].set_ylabel('Adverse Weather Metric', fontsize=12)
axes[0].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 2: Accident Type Code vs Adverse Weather Metric
sns.lineplot(
    ax=axes[1],
    x='Accident_Type_Code',
    y='Adverse_Weather_Metric',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[1].set_title('Accident Type Code vs Adverse Weather Metric', fontsize=14, weight='bold')
axes[1].set_xlabel('Accident Type Code', fontsize=12)
axes[1].set_ylabel('Adverse Weather Metric', fontsize=12)
axes[1].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 3: Days Since Inspection vs Adverse Weather Metric
sns.lineplot(
    ax=axes[2],
    x='Days_Since_Inspection',
    y='Adverse_Weather_Metric',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[2].set_title('Days Since Inspection vs Adverse Weather Metric', fontsize=14, weight='bold')
axes[2].set_xlabel('Days Since Inspection', fontsize=12)
axes[2].set_ylabel('Adverse Weather Metric', fontsize=12)
axes[2].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Adjusting layout for improved alignment and spacing
fig.tight_layout()
plt.subplots_adjust(right=0.85)  # To make space for legends

plt.show()
No description has been provided for this image

2. Max_Elevation¶

In [112]:
fig, axes = plt.subplots(3, 1, figsize=(14,12))

# Custom color palette
custom_palette = sns.color_palette("Spectral")

# Plot 1: Violations vs Max Elevation
sns.boxplot(
    ax=axes[0],
    x='Violations',
    y='Max_Elevation',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[0].set_title('Violations vs Max Elevation', fontsize=14, weight='bold')
axes[0].set_xlabel('Violations', fontsize=12)
axes[0].set_ylabel('Max Elevation (ft)', fontsize=12)
axes[0].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 2: Accident Type Code vs Max Elevation
sns.boxplot(
    ax=axes[1],
    x='Accident_Type_Code',
    y='Max_Elevation',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[1].set_title('Accident Type Code vs Max Elevation', fontsize=14, weight='bold')
axes[1].set_xlabel('Accident Type Code', fontsize=12)
axes[1].set_ylabel('Max Elevation (ft)', fontsize=12)
axes[1].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 3: Days Since Inspection vs Max Elevation
sns.boxplot(
    ax=axes[2],
    x='Days_Since_Inspection',
    y='Max_Elevation',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[2].set_title('Days Since Inspection vs Max Elevation', fontsize=14, weight='bold')
axes[2].set_xlabel('Days Since Inspection', fontsize=12)
axes[2].set_ylabel('Max Elevation (ft)', fontsize=12)
axes[2].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Adjust layout for better alignment
fig.tight_layout()
plt.subplots_adjust(right=0.85)  # To make space for legends

plt.show()
No description has been provided for this image
In [113]:
fig, axes = plt.subplots(3, 1, figsize=(14, 12))

# Custom color palette for professional visuals
custom_palette = sns.color_palette("husl")

# Plot 1: Violations vs Max Elevation
sns.lineplot(
    ax=axes[0],
    x='Violations',
    y='Max_Elevation',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[0].set_title('Violations vs Max Elevation', fontsize=14, weight='bold')
axes[0].set_xlabel('Violations', fontsize=12)
axes[0].set_ylabel('Max Elevation (ft)', fontsize=12)
axes[0].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 2: Accident Type Code vs Max Elevation
sns.lineplot(
    ax=axes[1],
    x='Accident_Type_Code',
    y='Max_Elevation',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[1].set_title('Accident Type Code vs Max Elevation', fontsize=14, weight='bold')
axes[1].set_xlabel('Accident Type Code', fontsize=12)
axes[1].set_ylabel('Max Elevation (ft)', fontsize=12)
axes[1].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 3: Days Since Inspection vs Max Elevation
sns.lineplot(
    ax=axes[2],
    x='Days_Since_Inspection',
    y='Max_Elevation',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[2].set_title('Days Since Inspection vs Max Elevation', fontsize=14, weight='bold')
axes[2].set_xlabel('Days Since Inspection', fontsize=12)
axes[2].set_ylabel('Max Elevation (ft)', fontsize=12)
axes[2].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Adjust layout for better spacing
fig.tight_layout()
plt.subplots_adjust(right=0.85)  # To make space for legends

plt.show()
No description has been provided for this image

3. Cabin_Temperature¶

In [114]:
fig, axes = plt.subplots(3, 1, figsize=(14, 12))

# Custom color palette
custom_palette = sns.color_palette("Spectral")

# Plot 1: Violations vs Cabin Temperature
sns.boxplot(
    ax=axes[0],
    x='Violations',
    y='Cabin_Temperature',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[0].set_title('Violations vs Cabin Temperature', fontsize=14, weight='bold')
axes[0].set_xlabel('Violations', fontsize=12)
axes[0].set_ylabel('Cabin Temperature (°C)', fontsize=12)
axes[0].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 2: Accident Type Code vs Cabin Temperature
sns.boxplot(
    ax=axes[1],
    x='Accident_Type_Code',
    y='Cabin_Temperature',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[1].set_title('Accident Type Code vs Cabin Temperature', fontsize=14, weight='bold')
axes[1].set_xlabel('Accident Type Code', fontsize=12)
axes[1].set_ylabel('Cabin Temperature (°C)', fontsize=12)
axes[1].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 3: Days Since Inspection vs Cabin Temperature
sns.boxplot(
    ax=axes[2],
    x='Days_Since_Inspection',
    y='Cabin_Temperature',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[2].set_title('Days Since Inspection vs Cabin Temperature', fontsize=14, weight='bold')
axes[2].set_xlabel('Days Since Inspection', fontsize=12)
axes[2].set_ylabel('Cabin Temperature (°C)', fontsize=12)
axes[2].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Adjusting layout for better alignment and spacing
fig.tight_layout()
plt.subplots_adjust(right=0.85)  # To make space for legends

plt.show()
No description has been provided for this image
In [115]:
fig, axes = plt.subplots(3, 1, figsize=(14, 12))

# Custom color palette
custom_palette = sns.color_palette("coolwarm")

# Plot 1: Violations vs Cabin Temperature
sns.lineplot(
    ax=axes[0],
    x='Violations',
    y='Cabin_Temperature',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[0].set_title('Violations vs Cabin Temperature', fontsize=14, weight='bold')
axes[0].set_xlabel('Violations', fontsize=12)
axes[0].set_ylabel('Cabin Temperature (°C)', fontsize=12)
axes[0].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 2: Accident Type Code vs Cabin Temperature
sns.lineplot(
    ax=axes[1],
    x='Accident_Type_Code',
    y='Cabin_Temperature',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[1].set_title('Accident Type Code vs Cabin Temperature', fontsize=14, weight='bold')
axes[1].set_xlabel('Accident Type Code', fontsize=12)
axes[1].set_ylabel('Cabin Temperature (°C)', fontsize=12)
axes[1].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 3: Days Since Inspection vs Cabin Temperature
sns.lineplot(
    ax=axes[2],
    x='Days_Since_Inspection',
    y='Cabin_Temperature',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[2].set_title('Days Since Inspection vs Cabin Temperature', fontsize=14, weight='bold')
axes[2].set_xlabel('Days Since Inspection', fontsize=12)
axes[2].set_ylabel('Cabin Temperature (°C)', fontsize=12)
axes[2].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Adjusting layout for better spacing
fig.tight_layout()
plt.subplots_adjust(right=0.85)  # To make space for legends

plt.show()
No description has been provided for this image

4. Turbulence_In_gforces¶

In [116]:
fig, axes = plt.subplots(3, 1, figsize=(14, 1))

# Custom color palette
custom_palette = sns.color_palette("Set2")

# Plot 1: Violations vs Turbulence In g-forces
sns.boxplot(
    ax=axes[0],
    x='Violations',
    y='Turbulence_In_gforces',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[0].set_title('Violations vs Turbulence In g-forces', fontsize=14, weight='bold')
axes[0].set_xlabel('Violations', fontsize=12)
axes[0].set_ylabel('Turbulence (g-forces)', fontsize=12)
axes[0].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 2: Accident Type Code vs Turbulence In g-forces
sns.boxplot(
    ax=axes[1],
    x='Accident_Type_Code',
    y='Turbulence_In_gforces',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[1].set_title('Accident Type Code vs Turbulence In g-forces', fontsize=14, weight='bold')
axes[1].set_xlabel('Accident Type Code', fontsize=12)
axes[1].set_ylabel('Turbulence (g-forces)', fontsize=12)
axes[1].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 3: Days Since Inspection vs Turbulence In g-forces
sns.boxplot(
    ax=axes[2],
    x='Days_Since_Inspection',
    y='Turbulence_In_gforces',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[2].set_title('Days Since Inspection vs Turbulence In g-forces', fontsize=14, weight='bold')
axes[2].set_xlabel('Days Since Inspection', fontsize=12)
axes[2].set_ylabel('Turbulence (g-forces)', fontsize=12)
axes[2].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Adjusting layout for better alignment and spacing
fig.tight_layout()
plt.subplots_adjust(right=0.85)  # To make space for legends

plt.show()
No description has been provided for this image
In [117]:
fig, axes = plt.subplots(3, 1, figsize=(14, 12))

# Custom color palette
custom_palette = sns.color_palette("coolwarm")

# Plot 1: Violations vs Turbulence In g-forces
sns.lineplot(
    ax=axes[0],
    x='Violations',
    y='Turbulence_In_gforces',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[0].set_title('Violations vs Turbulence In g-forces', fontsize=14, weight='bold')
axes[0].set_xlabel('Violations', fontsize=12)
axes[0].set_ylabel('Turbulence (g-forces)', fontsize=12)
axes[0].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 2: Accident Type Code vs Turbulence In g-forces
sns.lineplot(
    ax=axes[1],
    x='Accident_Type_Code',
    y='Turbulence_In_gforces',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[1].set_title('Accident Type Code vs Turbulence In g-forces', fontsize=14, weight='bold')
axes[1].set_xlabel('Accident Type Code', fontsize=12)
axes[1].set_ylabel('Turbulence (g-forces)', fontsize=12)
axes[1].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 3: Days Since Inspection vs Turbulence In g-forces
sns.lineplot(
    ax=axes[2],
    x='Days_Since_Inspection',
    y='Turbulence_In_gforces',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[2].set_title('Days Since Inspection vs Turbulence In g-forces', fontsize=14, weight='bold')
axes[2].set_xlabel('Days Since Inspection', fontsize=12)
axes[2].set_ylabel('Turbulence (g-forces)', fontsize=12)
axes[2].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Adjusting layout for better alignment and spacing
fig.tight_layout()
plt.subplots_adjust(right=0.85)  # To make space for legends

plt.show()
No description has been provided for this image

5. Control_Metric¶

In [118]:
fig, axes = plt.subplots(3, 1, figsize=(14, 12))

# Custom color palette
custom_palette = sns.color_palette("Spectral")

# Plot 1: Violations vs Control Metric
sns.boxplot(
    ax=axes[0],
    x='Violations',
    y='Control_Metric',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[0].set_title('Violations vs Control Metric', fontsize=14, weight='bold')
axes[0].set_xlabel('Violations', fontsize=12)
axes[0].set_ylabel('Control Metric', fontsize=12)
axes[0].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 2: Accident Type Code vs Control Metric
sns.boxplot(
    ax=axes[1],
    x='Accident_Type_Code',
    y='Control_Metric',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[1].set_title('Accident Type Code vs Control Metric', fontsize=14, weight='bold')
axes[1].set_xlabel('Accident Type Code', fontsize=12)
axes[1].set_ylabel('Control Metric', fontsize=12)
axes[1].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 3: Days Since Inspection vs Control Metric
sns.boxplot(
    ax=axes[2],
    x='Days_Since_Inspection',
    y='Control_Metric',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[2].set_title('Days Since Inspection vs Control Metric', fontsize=14, weight='bold')
axes[2].set_xlabel('Days Since Inspection', fontsize=12)
axes[2].set_ylabel('Control Metric', fontsize=12)
axes[2].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Adjusting layout for better alignment and spacing
fig.tight_layout()
plt.subplots_adjust(right=0.85)  # To make space for legends

plt.show()
No description has been provided for this image
In [119]:
fig, axes = plt.subplots(3, 1, figsize=(14, 12))

# Custom color palette
custom_palette = sns.color_palette("coolwarm")

# Plot 1: Violations vs Control Metric
sns.lineplot(
    ax=axes[0],
    x='Violations',
    y='Control_Metric',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[0].set_title('Violations vs Control Metric', fontsize=14, weight='bold')
axes[0].set_xlabel('Violations', fontsize=12)
axes[0].set_ylabel('Control Metric', fontsize=12)
axes[0].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 2: Accident Type Code vs Control Metric
sns.lineplot(
    ax=axes[1],
    x='Accident_Type_Code',
    y='Control_Metric',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[1].set_title('Accident Type Code vs Control Metric', fontsize=14, weight='bold')
axes[1].set_xlabel('Accident Type Code', fontsize=12)
axes[1].set_ylabel('Control Metric', fontsize=12)
axes[1].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 3: Days Since Inspection vs Control Metric
sns.lineplot(
    ax=axes[2],
    x='Days_Since_Inspection',
    y='Control_Metric',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[2].set_title('Days Since Inspection vs Control Metric', fontsize=14, weight='bold')
axes[2].set_xlabel('Days Since Inspection', fontsize=12)
axes[2].set_ylabel('Control Metric', fontsize=12)
axes[2].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Adjusting layout for better alignment and spacing
fig.tight_layout()
plt.subplots_adjust(right=0.85)  # To make space for legends

plt.show()
No description has been provided for this image

6. Total_Safety_Complaints¶

In [120]:
fig, axes = plt.subplots(3, 1, figsize=(14, 12))

# Custom color palette
custom_palette = sns.color_palette("coolwarm")

# Plot 1: Violations vs Total Safety Complaints
sns.boxplot(
    ax=axes[0],
    x='Violations',
    y='Total_Safety_Complaints',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[0].set_title('Violations vs Total Safety Complaints', fontsize=14, weight='bold')
axes[0].set_xlabel('Violations', fontsize=12)
axes[0].set_ylabel('Total Safety Complaints', fontsize=12)
axes[0].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 2: Accident Type Code vs Total Safety Complaints
sns.boxplot(
    ax=axes[1],
    x='Accident_Type_Code',
    y='Total_Safety_Complaints',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[1].set_title('Accident Type Code vs Total Safety Complaints', fontsize=14, weight='bold')
axes[1].set_xlabel('Accident Type Code', fontsize=12)
axes[1].set_ylabel('Total Safety Complaints', fontsize=12)
axes[1].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 3: Days Since Inspection vs Total Safety Complaints
sns.boxplot(
    ax=axes[2],
    x='Days_Since_Inspection',
    y='Total_Safety_Complaints',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[2].set_title('Days Since Inspection vs Total Safety Complaints', fontsize=14, weight='bold')
axes[2].set_xlabel('Days Since Inspection', fontsize=12)
axes[2].set_ylabel('Total Safety Complaints', fontsize=12)
axes[2].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Adjusting layout for better alignment and spacing
fig.tight_layout()
plt.subplots_adjust(right=0.85)  # To make space for legends

plt.show()
No description has been provided for this image
In [121]:
fig, axes = plt.subplots(3, 1, figsize=(14, 12))

# Custom color palette
custom_palette = sns.color_palette("coolwarm")

# Plot 1: Violations vs Total Safety Complaints
sns.boxplot(
    ax=axes[0],
    x='Violations',
    y='Total_Safety_Complaints',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[0].set_title('Violations vs Total Safety Complaints', fontsize=14, weight='bold')
axes[0].set_xlabel('Violations', fontsize=12)
axes[0].set_ylabel('Total Safety Complaints', fontsize=12)
axes[0].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 2: Accident Type Code vs Total Safety Complaints
sns.boxplot(
    ax=axes[1],
    x='Accident_Type_Code',
    y='Total_Safety_Complaints',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[1].set_title('Accident Type Code vs Total Safety Complaints', fontsize=14, weight='bold')
axes[1].set_xlabel('Accident Type Code', fontsize=12)
axes[1].set_ylabel('Total Safety Complaints', fontsize=12)
axes[1].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 3: Days Since Inspection vs Total Safety Complaints
sns.boxplot(
    ax=axes[2],
    x='Days_Since_Inspection',
    y='Total_Safety_Complaints',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[2].set_title('Days Since Inspection vs Total Safety Complaints', fontsize=14, weight='bold')
axes[2].set_xlabel('Days Since Inspection', fontsize=12)
axes[2].set_ylabel('Total Safety Complaints', fontsize=12)
axes[2].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Adjusting layout for better alignment and spacing
fig.tight_layout()
plt.subplots_adjust(right=0.85)  # To make space for legends

plt.show()
No description has been provided for this image

7. Safety_Score¶

In [122]:
fig, axes = plt.subplots(3, 1, figsize=(14, 12))

# Custom color palette
custom_palette = sns.color_palette("coolwarm")

# Plot 1: Violations vs Safety Score
sns.boxplot(
    ax=axes[0],
    x='Violations',
    y='Safety_Score',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[0].set_title('Violations vs Safety Score', fontsize=14, weight='bold')
axes[0].set_xlabel('Violations', fontsize=12)
axes[0].set_ylabel('Safety Score', fontsize=12)
axes[0].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 2: Accident Type Code vs Safety Score
sns.boxplot(
    ax=axes[1],
    x='Accident_Type_Code',
    y='Safety_Score',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[1].set_title('Accident Type Code vs Safety Score', fontsize=14, weight='bold')
axes[1].set_xlabel('Accident Type Code', fontsize=12)
axes[1].set_ylabel('Safety Score', fontsize=12)
axes[1].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 3: Days Since Inspection vs Safety Score
sns.boxplot(
    ax=axes[2],
    x='Days_Since_Inspection',
    y='Safety_Score',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[2].set_title('Days Since Inspection vs Safety Score', fontsize=14, weight='bold')
axes[2].set_xlabel('Days Since Inspection', fontsize=12)
axes[2].set_ylabel('Safety Score', fontsize=12)
axes[2].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Adjusting layout for better alignment and spacing
fig.tight_layout()
plt.subplots_adjust(right=0.85)  # To make space for legends

plt.show()
No description has been provided for this image
In [123]:
fig, axes = plt.subplots(3, 1, figsize=(14, 12))

# Custom color palette
custom_palette = sns.color_palette("coolwarm")

# Plot 1: Violations vs Safety Score
sns.lineplot(
    ax=axes[0],
    x='Violations',
    y='Safety_Score',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[0].set_title('Violations vs Safety Score', fontsize=14, weight='bold')
axes[0].set_xlabel('Violations', fontsize=12)
axes[0].set_ylabel('Safety Score', fontsize=12)
axes[0].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 2: Accident Type Code vs Safety Score
sns.lineplot(
    ax=axes[1],
    x='Accident_Type_Code',
    y='Safety_Score',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[1].set_title('Accident Type Code vs Safety Score', fontsize=14, weight='bold')
axes[1].set_xlabel('Accident Type Code', fontsize=12)
axes[1].set_ylabel('Safety Score', fontsize=12)
axes[1].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 3: Days Since Inspection vs Safety Score
sns.lineplot(
    ax=axes[2],
    x='Days_Since_Inspection',
    y='Safety_Score',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[2].set_title('Days Since Inspection vs Safety Score', fontsize=14, weight='bold')
axes[2].set_xlabel('Days Since Inspection', fontsize=12)
axes[2].set_ylabel('Safety Score', fontsize=12)
axes[2].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Adjusting layout for better alignment and spacing
fig.tight_layout()
plt.subplots_adjust(right=0.85)  # To make space for legends

plt.show()
No description has been provided for this image
In [124]:
train2 = train2[train2.columns.drop('Severity')]

Feature Scaling¶

In [125]:
from sklearn import preprocessing
scaler= preprocessing.StandardScaler()
scaled_df= scaler.fit_transform(train2)
scaled_df= pd.DataFrame(scaled_df, columns= ['Safety_Score', 'Control_Metric', 'Turbulence_In_gforces', 'Cabin_Temperature', 'Max_Elevation', 'Adverse_Weather_Metric', 'Total_Safety_Complaints', 'Days_Since_Inspection', 'Violations','Accident_Type_Code'])
scaled_df.head()
Out[125]:
Safety_Score Control_Metric Turbulence_In_gforces Cabin_Temperature Max_Elevation Adverse_Weather_Metric Total_Safety_Complaints Days_Since_Inspection Violations Accident_Type_Code
0 0.453367 0.454535 -0.921634 -0.699871 -0.070377 0.957899 1.610763 0.301598 0.949701 -0.953985
1 1.258588 0.548247 0.492892 1.651807 -0.633557 0.861473 1.819583 -0.828441 -0.011644 -0.953985
2 1.294685 0.013479 -0.378633 -0.392618 0.770939 -1.550767 1.289873 0.019089 0.949701 1.674775
3 0.383952 0.779384 -0.239072 0.679728 1.142355 0.597343 0.726578 -0.545931 -0.972989 -0.428233
4 -0.929391 -1.356682 1.271662 -1.033217 0.372228 0.504093 1.740913 0.019089 -0.011644 -0.428233
  • The StandardScaler from sklearn.preprocessing is used for standardization.
  • Standardization transforms each feature to have a mean of 0 and a standard deviation of 1, ensuring a common scale without distorting relative relationships.
  • scaled_df.head() outputs the first 5 rows of the standardized data.
  • This allows verification that scaling was applied correctly and data integrity was maintained.
In [126]:
#Let's check the mean(Should be approximtaley 0) and SD(Ideally 1) of the scaled dataframe
scaled_df.mean()
Out[126]:
0
Safety_Score -1.024206e-16
Control_Metric -2.091087e-16
Turbulence_In_gforces 5.469543e-16
Cabin_Temperature 1.974441e-14
Max_Elevation -5.049903e-17
Adverse_Weather_Metric -2.973042e-16
Total_Safety_Complaints 2.236894e-16
Days_Since_Inspection 2.489389e-16
Violations -3.200643e-17
Accident_Type_Code 2.845016e-18

In [127]:
scaled_df.std()
Out[127]:
0
Safety_Score 1.00005
Control_Metric 1.00005
Turbulence_In_gforces 1.00005
Cabin_Temperature 1.00005
Max_Elevation 1.00005
Adverse_Weather_Metric 1.00005
Total_Safety_Complaints 1.00005
Days_Since_Inspection 1.00005
Violations 1.00005
Accident_Type_Code 1.00005

In [128]:
#Let's check the distribution of Variables now
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(15, 15))

ax1.set_title('Before Scaling')
sns.kdeplot(df_train['Safety_Score'], ax=ax1)
sns.kdeplot(df_train['Days_Since_Inspection'], ax=ax1)
sns.kdeplot(df_train['Total_Safety_Complaints'], ax=ax1)
sns.kdeplot(df_train['Control_Metric'], ax=ax1)
sns.kdeplot(df_train['Turbulence_In_gforces'], ax=ax1)
sns.kdeplot(df_train['Cabin_Temperature'], ax=ax1)
sns.kdeplot(df_train['Max_Elevation'], ax=ax1)
sns.kdeplot(df_train['Violations'], ax=ax1)
sns.kdeplot(df_train['Adverse_Weather_Metric'], ax=ax1)


ax2.set_title('After Standard Scaler')
sns.kdeplot(scaled_df['Safety_Score'], ax=ax2)
sns.kdeplot(scaled_df['Days_Since_Inspection'], ax=ax2)
sns.kdeplot(scaled_df['Total_Safety_Complaints'], ax=ax2)
sns.kdeplot(scaled_df['Control_Metric'], ax=ax2)
sns.kdeplot(scaled_df['Turbulence_In_gforces'], ax=ax2)
sns.kdeplot(scaled_df['Cabin_Temperature'], ax=ax2)
sns.kdeplot(scaled_df['Max_Elevation'], ax=ax2)
sns.kdeplot(scaled_df['Violations'], ax=ax2)
sns.kdeplot(scaled_df['Adverse_Weather_Metric'], ax=ax2)

plt.show()
No description has been provided for this image
In [129]:
# Define the number of variables and create subplots
num_vars = testing2.columns
num_plots = len(num_vars)
rows = (num_plots + 1) // 2  # Arrange in a grid with 2 plots per row
fig, axes = plt.subplots(rows, 2, figsize=(14, rows * 4))  # Adjust size dynamically
axes = axes.flatten()  # Flatten the 2D array of axes for easier indexing

# Plot each variable in its respective subplot
for i, var in enumerate(num_vars):
    sns.histplot(testing2[var], kde=True, color="dodgerblue", ax=axes[i])  # Use histplot with KDE
    axes[i].set_title(f"Distribution of {var}", fontsize=14, pad=10)
    axes[i].set_xlabel(var, fontsize=12)
    axes[i].set_ylabel("Frequency", fontsize=12)

# Hide any unused subplots (if the number of variables is odd)
for j in range(num_plots, len(axes)):
    axes[j].set_visible(False)

# Adjust layout to remove unwanted spaces
plt.tight_layout()

# Show the plot
plt.show()
No description has been provided for this image

It's quite clear that the data is now normally distributed with a mean 0 and a standard devaition of 1. Let's apply the same transformation to Test Data before proceeding with Model fitting

In [130]:
#Applying transformations
testing2['Total_Safety_Complaints'] = np.log(testing2['Total_Safety_Complaints']+1)
testing2['Adverse_Weather_Metric'] = np.log(testing2['Adverse_Weather_Metric']+1)
testing2['Cabin_Temperature'] = np.log(testing2['Cabin_Temperature']+1)
testing2['Turbulence_In_gforces'] = np.log(testing2['Turbulence_In_gforces']+1)

#Fixing left skew
testing2['Control_Metric'] = np.power(testing2['Control_Metric'], 2)
In [131]:
testing2.head()
Out[131]:
Safety_Score Days_Since_Inspection Total_Safety_Complaints Control_Metric Turbulence_In_gforces Cabin_Temperature Max_Elevation Violations Adverse_Weather_Metric Accident_ID
0 19.497717 16 1.945910 5205.813236 0.328554 4.373490 37949.724386 2 0.067371 1
1 58.173516 15 1.386294 4171.252251 0.223816 4.377014 30194.805567 2 0.002774 10
2 33.287671 15 1.386294 4188.933272 0.290180 4.476882 17572.925484 1 0.004307 14
3 3.287671 21 1.791759 4404.022240 0.351906 4.405010 40209.186341 2 0.182314 17
4 10.867580 18 1.098612 3148.058972 0.272488 4.384773 35495.525408 2 0.394536 21
In [132]:
ID_Col= testing2[['Accident_ID']]
testing_df= testing2.drop(['Accident_ID'], axis=1)
testing_df.head()
Out[132]:
Safety_Score Days_Since_Inspection Total_Safety_Complaints Control_Metric Turbulence_In_gforces Cabin_Temperature Max_Elevation Violations Adverse_Weather_Metric
0 19.497717 16 1.945910 5205.813236 0.328554 4.373490 37949.724386 2 0.067371
1 58.173516 15 1.386294 4171.252251 0.223816 4.377014 30194.805567 2 0.002774
2 33.287671 15 1.386294 4188.933272 0.290180 4.476882 17572.925484 1 0.004307
3 3.287671 21 1.791759 4404.022240 0.351906 4.405010 40209.186341 2 0.182314
4 10.867580 18 1.098612 3148.058972 0.272488 4.384773 35495.525408 2 0.394536
In [133]:
#Standardization
scaler= preprocessing.StandardScaler()
scaled_df_test= scaler.fit_transform(testing_df)
scaled_df_test= pd.DataFrame(scaled_df_test, columns= ['Safety_Score', 'Days_Since_Inspection','Total_Safety_Complaints', 'Control_Metric', 'Turbulence_In_gforces', 'Cabin_Temperature', 'Max_Elevation', 'Violations', 'Adverse_Weather_Metric'])
scaled_df_test.head()
Out[133]:
Safety_Score Days_Since_Inspection Total_Safety_Complaints Control_Metric Turbulence_In_gforces Cabin_Temperature Max_Elevation Violations Adverse_Weather_Metric
0 -1.371727 0.866845 0.355919 0.542584 0.153563 -0.614333 0.586995 0.009034 -0.483246
1 1.004384 0.582969 -0.232222 -0.157369 -1.111490 -0.507807 -0.230758 0.009034 -0.741989
2 -0.524519 0.582969 -0.232222 -0.145406 -0.309925 2.511257 -1.561731 -0.972910 -0.735846
3 -2.367618 2.286227 0.193911 0.000116 0.435613 0.338538 0.825254 0.009034 -0.022850
4 -1.901934 1.434598 -0.534568 -0.849630 -0.523613 -0.273256 0.328201 0.009034 0.827197

Model Building¶

Feature Selection¶

In [134]:
# Rearrange train data
train_df= scaled_df[['Safety_Score', 'Days_Since_Inspection','Total_Safety_Complaints', 'Control_Metric', 'Turbulence_In_gforces', 'Cabin_Temperature', 'Max_Elevation', 'Violations', 'Adverse_Weather_Metric']]
train_df.head()
Out[134]:
Safety_Score Days_Since_Inspection Total_Safety_Complaints Control_Metric Turbulence_In_gforces Cabin_Temperature Max_Elevation Violations Adverse_Weather_Metric
0 0.453367 0.301598 1.610763 0.454535 -0.921634 -0.699871 -0.070377 0.949701 0.957899
1 1.258588 -0.828441 1.819583 0.548247 0.492892 1.651807 -0.633557 -0.011644 0.861473
2 1.294685 0.019089 1.289873 0.013479 -0.378633 -0.392618 0.770939 0.949701 -1.550767
3 0.383952 -0.545931 0.726578 0.779384 -0.239072 0.679728 1.142355 -0.972989 0.597343
4 -0.929391 0.019089 1.740913 -1.356682 1.271662 -1.033217 0.372228 -0.011644 0.504093
In [135]:
# Check if it's same as original scaled dataframe
scaled_df.head()
Out[135]:
Safety_Score Control_Metric Turbulence_In_gforces Cabin_Temperature Max_Elevation Adverse_Weather_Metric Total_Safety_Complaints Days_Since_Inspection Violations Accident_Type_Code
0 0.453367 0.454535 -0.921634 -0.699871 -0.070377 0.957899 1.610763 0.301598 0.949701 -0.953985
1 1.258588 0.548247 0.492892 1.651807 -0.633557 0.861473 1.819583 -0.828441 -0.011644 -0.953985
2 1.294685 0.013479 -0.378633 -0.392618 0.770939 -1.550767 1.289873 0.019089 0.949701 1.674775
3 0.383952 0.779384 -0.239072 0.679728 1.142355 0.597343 0.726578 -0.545931 -0.972989 -0.428233
4 -0.929391 -1.356682 1.271662 -1.033217 0.372228 0.504093 1.740913 0.019089 -0.011644 -0.428233
In [136]:
#Put into X and y arrays
X= train_df
y= df_train['Severity']

Train-Test Split¶

In [137]:
#Split into train and validation sets
X_train, X_Val, y_train, y_Val= train_test_split(X, y, test_size=0.2, random_state=20)
print("shape of training data:", X_train.shape, "\nShape of Validation data:", X_Val.shape, "\nShape of training label:", y_train.shape, "\nShape of Validation label:", y_Val.shape)
shape of training data: (7992, 9) 
Shape of Validation data: (1998, 9) 
Shape of training label: (7992,) 
Shape of Validation label: (1998,)
  • Proportion: The dataset is successfully split into 80% training and 20% validation subsets.
  • Consistency: Shapes of features and labels align correctly between training and validation sets.
  • Next Steps: This split allows the model to be trained on X_train and y_train and evaluated on X_Val and y_Val.
In [138]:
X_train.head()
Out[138]:
Safety_Score Days_Since_Inspection Total_Safety_Complaints Control_Metric Turbulence_In_gforces Cabin_Temperature Max_Elevation Violations Adverse_Weather_Metric
113 -0.623963 0.584108 -0.009242 -0.385082 0.980217 0.201459 1.135712 0.949701 -0.957652
6341 -1.234820 0.584108 0.726578 -0.388734 0.191404 2.301726 -0.314512 0.949701 1.435191
104 -0.618409 0.584108 -0.551515 -0.385082 -0.125063 1.178620 -0.877074 -0.011644 1.330455
1698 0.950383 -1.393461 1.289873 1.151961 -0.875460 0.032944 0.892252 -0.011644 -1.005975
2586 -0.526781 0.584108 0.489697 -1.316585 0.331465 0.747933 -1.024185 -0.972989 -1.918657
In [139]:
y_train.head()
Out[139]:
Severity
113 3
6347 4
104 3
1700 3
2588 1

Model Training¶

Baseline Models¶

This experiment involves training and evaluating three machine learning models: Random Forest, XGBoost, and a Neural Network.

In [140]:
# prompt: use random forest

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Initialize and train the Random Forest Classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)  # You can adjust hyperparameters
rf_classifier.fit(X_train, y_train)

# Make predictions on the validation set
y_pred = rf_classifier.predict(X_Val)

# Evaluate the model
accuracy = accuracy_score(y_Val, y_pred)
print(f"Random Forest Accuracy: {accuracy}")

#Now you can use the trained model to predict on the test set
#test_predictions = rf_classifier.predict(scaled_df_test)
Random Forest Accuracy: 0.953953953953954

Random Forest Classifier:

Description:

  • A tree-based ensemble method that combines multiple decision trees to improve performance.
  • n_estimators=100: Uses 100 decision trees.

Accuracy:

  • The validation accuracy was 0.95.

Strengths:

  • Handles both categorical and numerical features well.
  • Robust to overfitting with enough trees.

Inference:

  • Achieved good accuracy on the validation set, making it a strong baseline.
  • May not handle multi-class classification as effectively as specialized methods like XGBoost.
In [141]:
from sklearn.preprocessing import LabelEncoder

# Encode the labels
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)  # Convert strings to integers
y_Val_encoded = label_encoder.transform(y_Val)          # Use the same encoding

import xgboost as xgb
from sklearn.metrics import accuracy_score

# Initialize and train the XGBoost Classifier
xgb_classifier = xgb.XGBClassifier(objective='multi:softmax', num_class=4, random_state=42)
xgb_classifier.fit(X_train, y_train_encoded)

# Make predictions on the validation set
y_pred_xgb = xgb_classifier.predict(X_Val)

# Decode predictions back to original labels if necessary
y_pred_xgb_decoded = label_encoder.inverse_transform(y_pred_xgb)
y_Val_decoded = label_encoder.inverse_transform(y_Val_encoded)

# Evaluate the model
accuracy_xgb = accuracy_score(y_Val_decoded, y_pred_xgb_decoded)
print(f"XGBoost Accuracy: {accuracy_xgb}")
XGBoost Accuracy: 0.9579579579579579

XGBoost Classifier:

Description:

  • A gradient-boosting algorithm optimized for speed and performance.
  • objective='multi:softmax': Used for multi-class classification.
  • num_class=4: Specifies four target classes.

Label Encoding:

  • Categorical labels were encoded into integers using LabelEncoder.
  • Predictions were decoded back to original labels for evaluation.

Accuracy:

  • Validation accuracy was 0.95.

Strengths:

  • Often achieves superior performance on structured data.
  • Built-in handling of multi-class tasks.

Inference:

  • Likely achieved better accuracy than Random Forest due to boosting and optimization for multi-class problems.
  • Well-suited for this problem if computation time is not a concern.

Neural Network¶

In [142]:
import tensorflow as tf
from tensorflow.keras import layers, regularizers
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_Val_scaled = scaler.transform(X_Val)
# Simplified and Optimized Neural Network Architecture
model = tf.keras.Sequential([


    layers.Dense(16, activation='relu'),


    layers.Dense(8, activation='relu'),

    layers.Dense(4, activation='softmax')  # Output layer for multi-class classification
])

# Compile the model
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
              loss='categorical_crossentropy',
              metrics=['accuracy'])

# Train the model
history = model.fit(X_train_scaled, pd.get_dummies(y_train).values,
                    epochs=100,  # Increase epochs
                    batch_size=16,  # Smaller batch size
                    validation_data=(X_Val_scaled, pd.get_dummies(y_Val).values))

# Evaluate the model
loss, accuracy = model.evaluate(X_Val_scaled, pd.get_dummies(y_Val).values)
print(f"Neural Network Accuracy: {accuracy}")
Epoch 1/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 4s 5ms/step - accuracy: 0.3197 - loss: 1.3781 - val_accuracy: 0.4424 - val_loss: 1.2317
Epoch 2/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.5000 - loss: 1.1351 - val_accuracy: 0.6587 - val_loss: 0.8995
Epoch 3/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.6888 - loss: 0.8225 - val_accuracy: 0.7773 - val_loss: 0.6658
Epoch 4/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - accuracy: 0.8091 - loss: 0.5935 - val_accuracy: 0.8323 - val_loss: 0.5331
Epoch 5/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - accuracy: 0.8371 - loss: 0.4916 - val_accuracy: 0.8684 - val_loss: 0.4566
Epoch 6/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - accuracy: 0.8772 - loss: 0.4043 - val_accuracy: 0.8939 - val_loss: 0.4071
Epoch 7/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.8912 - loss: 0.3861 - val_accuracy: 0.9099 - val_loss: 0.3790
Epoch 8/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - accuracy: 0.9134 - loss: 0.3297 - val_accuracy: 0.9174 - val_loss: 0.3564
Epoch 9/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9178 - loss: 0.3358 - val_accuracy: 0.9189 - val_loss: 0.3389
Epoch 10/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9300 - loss: 0.2765 - val_accuracy: 0.9189 - val_loss: 0.3281
Epoch 11/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9283 - loss: 0.2780 - val_accuracy: 0.9239 - val_loss: 0.3224
Epoch 12/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9378 - loss: 0.2533 - val_accuracy: 0.9229 - val_loss: 0.3098
Epoch 13/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9264 - loss: 0.2684 - val_accuracy: 0.9289 - val_loss: 0.2954
Epoch 14/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9320 - loss: 0.2565 - val_accuracy: 0.9304 - val_loss: 0.2907
Epoch 15/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - accuracy: 0.9272 - loss: 0.2668 - val_accuracy: 0.9279 - val_loss: 0.2891
Epoch 16/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.9282 - loss: 0.2530 - val_accuracy: 0.9319 - val_loss: 0.2846
Epoch 17/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9332 - loss: 0.2633 - val_accuracy: 0.9259 - val_loss: 0.2787
Epoch 18/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9353 - loss: 0.2458 - val_accuracy: 0.9319 - val_loss: 0.2779
Epoch 19/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9390 - loss: 0.2110 - val_accuracy: 0.9324 - val_loss: 0.2732
Epoch 20/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9343 - loss: 0.2428 - val_accuracy: 0.9324 - val_loss: 0.2713
Epoch 21/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9410 - loss: 0.2080 - val_accuracy: 0.9304 - val_loss: 0.2653
Epoch 22/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9377 - loss: 0.2139 - val_accuracy: 0.9239 - val_loss: 0.2671
Epoch 23/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9393 - loss: 0.2231 - val_accuracy: 0.9294 - val_loss: 0.2583
Epoch 24/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9406 - loss: 0.2184 - val_accuracy: 0.9344 - val_loss: 0.2583
Epoch 25/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.9389 - loss: 0.2216 - val_accuracy: 0.9319 - val_loss: 0.2553
Epoch 26/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.9390 - loss: 0.2361 - val_accuracy: 0.9344 - val_loss: 0.2561
Epoch 27/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - accuracy: 0.9397 - loss: 0.2097 - val_accuracy: 0.9359 - val_loss: 0.2543
Epoch 28/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9364 - loss: 0.2079 - val_accuracy: 0.9329 - val_loss: 0.2527
Epoch 29/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9369 - loss: 0.2051 - val_accuracy: 0.9244 - val_loss: 0.2522
Epoch 30/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9387 - loss: 0.2070 - val_accuracy: 0.9314 - val_loss: 0.2499
Epoch 31/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9375 - loss: 0.2042 - val_accuracy: 0.9309 - val_loss: 0.2513
Epoch 32/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9367 - loss: 0.2250 - val_accuracy: 0.9324 - val_loss: 0.2486
Epoch 33/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9429 - loss: 0.2054 - val_accuracy: 0.9314 - val_loss: 0.2540
Epoch 34/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9412 - loss: 0.1993 - val_accuracy: 0.9334 - val_loss: 0.2484
Epoch 35/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.9333 - loss: 0.2183 - val_accuracy: 0.9349 - val_loss: 0.2465
Epoch 36/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - accuracy: 0.9400 - loss: 0.2089 - val_accuracy: 0.9329 - val_loss: 0.2440
Epoch 37/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9384 - loss: 0.2006 - val_accuracy: 0.9304 - val_loss: 0.2487
Epoch 38/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9389 - loss: 0.2099 - val_accuracy: 0.9314 - val_loss: 0.2500
Epoch 39/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9390 - loss: 0.2074 - val_accuracy: 0.9339 - val_loss: 0.2475
Epoch 40/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9407 - loss: 0.1927 - val_accuracy: 0.9349 - val_loss: 0.2437
Epoch 41/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9390 - loss: 0.2062 - val_accuracy: 0.9324 - val_loss: 0.2426
Epoch 42/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9409 - loss: 0.1999 - val_accuracy: 0.9274 - val_loss: 0.2426
Epoch 43/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9423 - loss: 0.1883 - val_accuracy: 0.9314 - val_loss: 0.2402
Epoch 44/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9359 - loss: 0.2077 - val_accuracy: 0.9284 - val_loss: 0.2453
Epoch 45/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9403 - loss: 0.1879 - val_accuracy: 0.9329 - val_loss: 0.2419
Epoch 46/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - accuracy: 0.9463 - loss: 0.1756 - val_accuracy: 0.9309 - val_loss: 0.2440
Epoch 47/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - accuracy: 0.9450 - loss: 0.1880 - val_accuracy: 0.9324 - val_loss: 0.2387
Epoch 48/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9436 - loss: 0.1890 - val_accuracy: 0.9269 - val_loss: 0.2411
Epoch 49/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9396 - loss: 0.1950 - val_accuracy: 0.9334 - val_loss: 0.2429
Epoch 50/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9341 - loss: 0.2130 - val_accuracy: 0.9274 - val_loss: 0.2391
Epoch 51/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9426 - loss: 0.1847 - val_accuracy: 0.9339 - val_loss: 0.2382
Epoch 52/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9424 - loss: 0.1970 - val_accuracy: 0.9334 - val_loss: 0.2371
Epoch 53/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9440 - loss: 0.1832 - val_accuracy: 0.9309 - val_loss: 0.2356
Epoch 54/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9412 - loss: 0.1890 - val_accuracy: 0.9279 - val_loss: 0.2406
Epoch 55/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - accuracy: 0.9361 - loss: 0.2022 - val_accuracy: 0.9299 - val_loss: 0.2365
Epoch 56/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - accuracy: 0.9404 - loss: 0.2006 - val_accuracy: 0.9309 - val_loss: 0.2360
Epoch 57/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9440 - loss: 0.1910 - val_accuracy: 0.9289 - val_loss: 0.2364
Epoch 58/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9404 - loss: 0.1854 - val_accuracy: 0.9314 - val_loss: 0.2335
Epoch 59/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9496 - loss: 0.1810 - val_accuracy: 0.9234 - val_loss: 0.2437
Epoch 60/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9378 - loss: 0.1824 - val_accuracy: 0.9309 - val_loss: 0.2334
Epoch 61/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9400 - loss: 0.1891 - val_accuracy: 0.9279 - val_loss: 0.2331
Epoch 62/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9393 - loss: 0.1884 - val_accuracy: 0.9349 - val_loss: 0.2285
Epoch 63/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9430 - loss: 0.1812 - val_accuracy: 0.9299 - val_loss: 0.2272
Epoch 64/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.9433 - loss: 0.1833 - val_accuracy: 0.9309 - val_loss: 0.2264
Epoch 65/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - accuracy: 0.9444 - loss: 0.1788 - val_accuracy: 0.9314 - val_loss: 0.2337
Epoch 66/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - accuracy: 0.9347 - loss: 0.2009 - val_accuracy: 0.9319 - val_loss: 0.2297
Epoch 67/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9399 - loss: 0.1901 - val_accuracy: 0.9309 - val_loss: 0.2314
Epoch 68/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9398 - loss: 0.2011 - val_accuracy: 0.9334 - val_loss: 0.2290
Epoch 69/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9369 - loss: 0.1884 - val_accuracy: 0.9309 - val_loss: 0.2330
Epoch 70/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9435 - loss: 0.1776 - val_accuracy: 0.9284 - val_loss: 0.2310
Epoch 71/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9446 - loss: 0.1846 - val_accuracy: 0.9299 - val_loss: 0.2282
Epoch 72/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9368 - loss: 0.2004 - val_accuracy: 0.9289 - val_loss: 0.2289
Epoch 73/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9466 - loss: 0.1722 - val_accuracy: 0.9309 - val_loss: 0.2291
Epoch 74/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - accuracy: 0.9451 - loss: 0.1781 - val_accuracy: 0.9344 - val_loss: 0.2205
Epoch 75/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 3s 3ms/step - accuracy: 0.9390 - loss: 0.2019 - val_accuracy: 0.9324 - val_loss: 0.2263
Epoch 76/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.9415 - loss: 0.1933 - val_accuracy: 0.9284 - val_loss: 0.2228
Epoch 77/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9449 - loss: 0.1750 - val_accuracy: 0.9329 - val_loss: 0.2278
Epoch 78/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9442 - loss: 0.1745 - val_accuracy: 0.9309 - val_loss: 0.2211
Epoch 79/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9407 - loss: 0.1909 - val_accuracy: 0.9349 - val_loss: 0.2173
Epoch 80/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9457 - loss: 0.1721 - val_accuracy: 0.9284 - val_loss: 0.2222
Epoch 81/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9443 - loss: 0.1829 - val_accuracy: 0.9319 - val_loss: 0.2192
Epoch 82/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9437 - loss: 0.1843 - val_accuracy: 0.9354 - val_loss: 0.2133
Epoch 83/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - accuracy: 0.9457 - loss: 0.1715 - val_accuracy: 0.9304 - val_loss: 0.2156
Epoch 84/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - accuracy: 0.9405 - loss: 0.1826 - val_accuracy: 0.9349 - val_loss: 0.2110
Epoch 85/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9427 - loss: 0.1863 - val_accuracy: 0.9329 - val_loss: 0.2163
Epoch 86/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9414 - loss: 0.1861 - val_accuracy: 0.9309 - val_loss: 0.2183
Epoch 87/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9399 - loss: 0.1864 - val_accuracy: 0.9264 - val_loss: 0.2153
Epoch 88/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9442 - loss: 0.1694 - val_accuracy: 0.9349 - val_loss: 0.2092
Epoch 89/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9425 - loss: 0.1733 - val_accuracy: 0.9299 - val_loss: 0.2162
Epoch 90/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9441 - loss: 0.1774 - val_accuracy: 0.9304 - val_loss: 0.2179
Epoch 91/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9453 - loss: 0.1694 - val_accuracy: 0.9334 - val_loss: 0.2157
Epoch 92/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - accuracy: 0.9444 - loss: 0.1711 - val_accuracy: 0.9299 - val_loss: 0.2163
Epoch 93/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - accuracy: 0.9460 - loss: 0.1647 - val_accuracy: 0.9364 - val_loss: 0.2085
Epoch 94/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9479 - loss: 0.1632 - val_accuracy: 0.9309 - val_loss: 0.2180
Epoch 95/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9432 - loss: 0.1743 - val_accuracy: 0.9324 - val_loss: 0.2178
Epoch 96/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9456 - loss: 0.1681 - val_accuracy: 0.9339 - val_loss: 0.2129
Epoch 97/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9428 - loss: 0.1755 - val_accuracy: 0.9284 - val_loss: 0.2186
Epoch 98/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9458 - loss: 0.1682 - val_accuracy: 0.9334 - val_loss: 0.2123
Epoch 99/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9447 - loss: 0.1737 - val_accuracy: 0.9319 - val_loss: 0.2104
Epoch 100/100
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9436 - loss: 0.1703 - val_accuracy: 0.9309 - val_loss: 0.2081
63/63 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.9382 - loss: 0.2013
Neural Network Accuracy: 0.9309309124946594

Neural Network:

Description:

  • A feedforward neural network with: Input Layer → 16 neurons → 8 neurons → 4 neurons (output layer for 4 classes with softmax activation).
  • Optimized with the Adam optimizer and categorical_crossentropy for multi-class classification.

Data Scaling:

  • Features were scaled using StandardScaler to ensure the neural network performs optimally.

Training:

  • Used 100 epochs and a batch size of 16.
  • Outputs validation accuracy during training.

Accuracy:

  • Final accuracy was 0.93.

Strengths:

  • Can capture non-linear relationships.
  • Flexible architecture allows customization.

Inference:

  • Neural networks may take longer to train and are prone to overfitting on small datasets.
  • Achieved competitive accuracy but may not outperform XGBoost for this structured data.

Conclusion¶